155 research outputs found

    On the Complexity of Spill Everywhere under SSA Form

    Get PDF
    Compilation for embedded processors can be either aggressive (time consuming cross-compilation) or just in time (embedded and usually dynamic). The heuristics used in dynamic compilation are highly constrained by limited resources, time and memory in particular. Recent results on the SSA form open promising directions for the design of new register allocation heuristics for embedded systems and especially for embedded compilation. In particular, heuristics based on tree scan with two separated phases -- one for spilling, then one for coloring/coalescing -- seem good candidates for designing memory-friendly, fast, and competitive register allocators. Still, also because of the side effect on power consumption, the minimization of loads and stores overhead (spilling problem) is an important issue. This paper provides an exhaustive study of the complexity of the ``spill everywhere'' problem in the context of the SSA form. Unfortunately, conversely to our initial hopes, many of the questions we raised lead to NP-completeness results. We identify some polynomial cases but that are impractical in JIT context. Nevertheless, they can give hints to simplify formulations for the design of aggressive allocators.Comment: 10 page

    Parameterized Construction of Program Representations for Sparse Dataflow Analyses

    Get PDF
    Data-flow analyses usually associate information with control flow regions. Informally, if these regions are too small, like a point between two consecutive statements, we call the analysis dense. On the other hand, if these regions include many such points, then we call it sparse. This paper presents a systematic method to build program representations that support sparse analyses. To pave the way to this framework we clarify the bibliography about well-known intermediate program representations. We show that our approach, up to parameter choice, subsumes many of these representations, such as the SSA, SSI and e-SSA forms. In particular, our algorithms are faster, simpler and more frugal than the previous techniques used to construct SSI - Static Single Information - form programs. We produce intermediate representations isomorphic to Choi et al.'s Sparse Evaluation Graphs (SEG) for the family of data-flow problems that can be partitioned per variables. However, contrary to SEGs, we can handle - sparsely - problems that are not in this family

    Size of snow particles in a powder-snow avalanche

    Get PDF
    The size of the snow particles involved in a powder-snow avalanche is a key parameter of the local dynamic of the flow. An experimental device has been realized to collect snow particles within powder-snow avalanches. Snow particles have been captured in the powder-snow part of an avalanche triggered artificially on the experimental test site of the vallée de la Sionne. The collected particles have been photographed and the pictures digitized. An image analysis tool to evaluate the size of the collected particles have been developed for the purpose of this study. The obtained order of magnitude is 0.2mm

    High-Level Synthesis of Pipelined FSM from Loop Nests

    Get PDF
    Embedded systems raise many challenges in power, space and speed efficiency. The current trend is to build heterogeneous systems on a chip with specialized processors and hardware accelerators. Generating an hardware accelerator from a computational kernel requires a deep reorganization of the code and the data. Typically, parallelism and memory bandwidth are met thanks to fine-grain loop transformations. Unfortunately, the resulting control automaton is often very complex and eventually bound the circuit frequency, which limits the benefits of the optimization. This is a major lock, which strongly limits the power of the code optimizations applicable by high-level synthesis tools.In this report, we propose an architecture of control automaton and an algorithm of high-level synthesis which translates efficiently the control required by fine-grain loop optimizations. Unlike the previous approaches, our control automaton can be pipelined at will, without any restriction. Hence, the frequency of the automaton can be as high as possible. Experimental results on FPGA confirms that our control circuit can reach a high frequency with a reasonable resource consumption

    On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

    Get PDF
    Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution

    SSI Revisited

    Get PDF
    The static single information (SSI) form, proposed by Ananian, then in a more general form by Singer, is an extension of the static single assignment (SSA) form. The latter is a well-established compiler intermediate representation that has been successfully used for numerous compiler analysis and optimizations. Several interesting results have also been shown for SSI concerning liveness analysis and representation of live-ranges of variables, which could make SSI appealing for just-in-time compilation. Unfortunately, previous literature on the SSI form is sparse and appears to be partly incorrect. Our paper corrects some of the mistakes that have been made. Our main result is a complete proof that, even for the most general definition of SSI, basic blocks, and thus program points, can be totally ordered so that live-ranges of variables correspond to intervals. This corrects the erroneous proof of Brisk and Sarrafzadeh

    Parallel Copy Motion

    Get PDF
    International audienceRecent results on the static single assignment (SSA) form open promising directions for the design of new register allocation heuristics for just-in-time (JIT) compilation. In particular, heuris- tics based on tree scans with two decoupled phases, one for spilling, one for splitting/coloring/coalescing, seem good candidates for de- signing memory-friendly, fast, and competitive register allocators. Another class of register allocators, well-suited for JIT compilation, are those based on linear scans. Most of them perform coalesc- ing poorly but also do live-range splitting (mostly on control-flow edges) to avoid spilling. This leads to a large amount of register-to- register copies inside basic blocks but also, implicitly, on critical edges, i.e., edges that flow from a block with several successors to a block with several predecessors. This paper presents a new back-end optimization that we call parallel copy motion. The technique is to move copy instructions in a register-allocated code from a program point, possibly an edge, to another. In contrast with a classical scheduler that must preserve data dependences, our copy motion also permutes register assign- ments so that a copy can "traverse" all instructions of a basic block, except those with conflicting register constraints. Thus, parallel copies can be placed either where the scheduling has some empty slots (for multiple-issues architectures), or where fewer copies are necessary because some variables are dead at this point. Moreover, to the cost of some code compensations (namely, the reverse of the copy), a copy can also be moved out from a critical edge. This pro- vides a simple solution to avoid critical-edge splitting, especially useful when the compiler cannot split it, as it is the case for abnor- mal edges. This compensation technique also enables the schedul- ing/motion of the copy in the successor or predecessor basic block. Experiments with the SPECint benchmarks suite and our own benchmark suite show that we can now apply broadly an SSA-based register allocator: all procedures, even with abnormal edges, can be treated. Simple strategies for moving copies from edges and locally inside basic block show significant average improvements (4% for SPECint and 3% for our suite), with no degradation. It let us believe that the approach is promising, and not only for improving coalescing in fast register allocators

    Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

    Get PDF
    Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

    A Polynomial Spilling Heuristic: Layered Allocation

    Get PDF
    Register allocation is one of the most important, and one of the oldest compiler optimizations. Its purpose is to map temporary variables to either machine registers or main memory locations and explicit load/store instructions. The latter option is referred to as spilling. This paper addresses the minimization of the spill code overhead, one of the di fficult problems in register allocation. We devised a heuristic approach called layered. It is rooted in the recent advances in SSA-based register allocation. As opposed to the conventional incremental spilling approaches, our method incrementally allocates clusters of variables. We describe a new polynomial method, the layered-optimal allocator, and demonstrate its quasi-optimiality on standard benchmarks and on two architectures.L'allocation de registres est l'une des premières et des plus importantes optimisations effectuées par les compilateurs. Elle a pour but d'associer aux variables temporaires du programme des registres de la machine ou des locations mémoires et d'insérer, dans le code, des instructions de load/store explicites, appelées vidage. Dans ce papier, nous nous intéressons à la minimisation des latences mémoires dues au code de vidage, un des problèmes difficiles en allocation de registres. Nous proposons une approche heuristique d'allocation par couches. Ce travail se base sur les récentes avancées en allocation de registres sous SSA. Contrairement à l'approche conventionnelle de vidage incrémental, notre méthode alloue les variables de manière incrémentale par groupe. Nous comparons notre approche, appelée allocation-optimale par couche, aux méthodes de l'état de l'art à une approche optimale et nous montrons l'allocation-optimale par couche est quasi-optimale sur des benchmarks standard et sur deux architectures différentes
    • …
    corecore